AN  EVALUATION  OF  SELECTED  ACOUSTIC  PARAMETERS 
FOR  USE  IN  SPEAKER  IDENTIFICATION 


By 
EDWARD  THOMAS  DOHERTY 


A  DISSERTATION  PRESENTED  TO  THE  GRADUATE  COUNCIL  OF 

THE  UNIVERSITY  OF  FLORIDA  IN  PARTIAL 

FULFILLMENT  OF  THE  REQUIREMENTS  FOR  THE  DEGREE  OF 

DOCTOR  OF  PHILOSOPHY 


UNIVERSITY  OF  FLORIDA 
1975 


ACKNOWLEDGEMENTS 

The  author  gratefully  acknowledges  counsel  of  Dr.  Harry  Hoi  lien 
who  provided  guidance  and  support  throughout  the  writer's  career  at 
the  University  of  Florida  and  throughout  the  course  of  this  study. 

The  author  is  also  pleased  to  acknowledge  the  constructive  comments 
of  his  supervisory  committee,  composed  of  Drs.  Teas,  Paige  and  Rothman. 


TABLE  OF  CONTENTS 


Page 


ACKNOWLEDGEMENTS  

LIST  OF  TABLES  

ABSTRACT  

CHAPTER 

I  INTRODUCTION  

II  PROCEDURE  

I  I  I  RESULTS  AND  DISCUSSION  . 

IV         SUMMARY  AND  CONCLUSIONS 

APPENDIX  

BIBLIOGRAPHY  

B I OGRAPH I CAL  SKETCH  


I 
\k 

22 
kO 

h5 

48 


LIST  OF  TABLES 


TABLE  Page 

1  Classification  of  Observations  Based  on  Long- 
Term  Power  Spectra  (LTS)  23 

2  A  Comparison  of  the  Classification  of  Speakers 
by  Means  of  Long-Term  Speech  Spectra  Using 
Euclidean  Distance,  Cross-Correlation  and  Discrim- 
inant Analysis  Techniques  (N=50)  27 

3  Classification  of  Observations  Based  on  Speaking 
Fundamental  Frequency  (SFF)  29 

k  Classification  of  Speakers  Based  on  Two  Forms  of 

Speaking  Time  Vector  (N=50)  32 

5  Classification  of  Observations  Based  on  Speaking 

Time  (ST)  32 

6  Classification  of  Observations  Based  on  Various 
Combinations  of  the  Limited  Passband  Long-Term 
Power  Spectra  (LTS),  Speaking  Fundamental  Fre- 
quency (SFF)  and  Speaking  Time  (ST)  33 


Abstract  of  Dissertation  Presented  to  the  Graduate  Council  of 

the  University  of  Florida  in  Partial  Fulfillment  of  the  Requirements 

for  the  Degree  of  Doctor  of  Philosophy 


AN  EVALUATION  OF  SELECTED  ACOUSTIC  PARAMETERS 
FOR  USE  IN  SPEAKER  IDENTIFICATION 


By 

Edward  Thomas  Doherty 


Chairman:   Harry  Hoi  lien 
Major  Department:   Speech 


An  investigation  was  conducted  to  examine  the  effectiveness  of 
certain  acoustic  and  temporal  properties  of  the  speech  signal  in  the 
determination  of  a  speaker's  identity  from  his  voice  alone.   Specif- 
ically, the  purposes  of  this  research  were  to:   (1)  examine  whether 
long-term  power  spectra  (LTS) ,  speaking  fundamental  frequency  (SFF) , 
and  speaking  time  (ST)  extracted  separately  from  a  speaker's  vocal 
output  provide  sufficient  bases  to  make  judgements  of  a  talker's 
identity;  (2)  test  whether  speaker  identification  rates  can  be  im- 
proved by  using  the  three  vectors  (LTS,  SFF  and  ST)  in  various  combi- 
nations; (3)  test  whether  the  specified  vectors  are  sufficiently  ro- 
bust to  function  in  the  presence  of  distortions  --  limited  passband, 
stress  or  disguise;  and  (k)    evaluate  differences  in  correct  identifi- 
cation levels  when  various  analytical  procedures,  i.e.,  Euclidean  dis- 
tance, cross-correlation  or   discriminant  analysis,  are   used. 

Readings  of  "An  Apology  for  Idlers"  were  recorded  from  two  groups 
of  speakers.   The  first  group  consisted  of  50  college-age  males  who 


performed  the  reading  "as  naturally  as  possible."  The  second  group  was 
made  up  of  25  males,  aged  25"i+5 ,  who  first  read  the  passage  in  a  nor- 
mal fashion  but  who  also  were  required  to  read  it  while  subjected  to 
stress  and  while  attempting  voice  disguise. 

Acoustic/temporal  analyses  were  performed  on  the  speakers'  utter- 
ances to  extract  the  LTS,  SFF  and  ST  vectors.   To  simulate  a  limited 
passband  for  LTS,  only  11  of  the  23  parameters  were  employed.   The  re- 
sults indicated  that:   (1)  the  LTS  vector  was  extremely  effective  for 
identifying  speakers  if  the  utterances  were  produced  normally,  (2)  SFF 
and  ST  were  far  less  effective,  (3)  combining  vectors  usually  improved 
correct  identification  levels,  (k)    when  the  speech  was  recorded  while 
the  talker  was  under  stress  or  attempting  a  disguise,  no  single  vector 
or  combination  adequately  differentiated  talkers,  and  (5)  under  the 
limited  conditions  of  this  study,  a  discriminant  analysis  is  a  more 
efficient  and  practical  method  of  determining  a  talker's  identity  from 
these  vectors  than  is  cross-correlation  or  Euclidean  distance. 

The  fact  that  no  single  vector  or  combination  was  capable  of 
maintaining  the  high  identification  rates  found  for  normal  productions 
under  conditions  of  stress  or  disguise  indicates  that  none  of  the  vec- 
tors measure  an  invariant  characteristic  of  an  individual's  speech. 


I 

INTRODUCTION 

The  concept  of  determining  the  identity  of  a  talker  from  acous- 
tic information  alone  is  intuitively  acceptable.   For  example,  any 
individual  can  recognize  a  large  number  of  known  speakers  simply  upon 
hearing  their  voices.   The  speech  sample  in  question  may  be  produced 
by  a  relative,  politician,  movie  or  television  personality,  etc.,  i.e., 
a  person  who  previously  has  supplied  sufficient  material  to  provide  the 
listener  with  a  set  of  discrimination  criteria.   In  fact,  several  in- 
vestigators (Bricker  and  Pruzansky,  1966;  LaRiviere,  197^;  LaRiviere, 
1975;  Majewski,  Hoi  lien  and  Doherty,  unpublished  manuscript;  Pollack, 
Pickett  and  Sumby,  1954;  Stevens,  Williams,  Carbonell  and  Woods,  1968) 
have  demonstrated  that  the  aural  mechanism  can  be  used  effectively  to 
identify  talkers.   In  particular,  Stevens  et  al .  (1968)  found  that,  for 
single  words,  an  aural  technique  was  more  effective  than  spect rography. 
Bricker  and  Pruzansky  (1966)  reported  an  aural  identification  level  of 
98%  when  the  listeners  heard  entire  sentences  from  the  speakers.   In 
another  study  in  this  area,  Majewski  et  al .  (unpublished  manuscript) 
demonstrated  that  listeners  can  reliably  identify  their  co-workers  under 
normal  and  stress  conditions  and,  apparently,  their  identification  cap- 
ability is  well  above  chance  even  when  the  talkers  attempt  to  disguise 
the  i  r  voices. 

In  addition  to  aural  speaker  identification,  several  techniques 


based  on  machine  processing  have  been  utilized.   For  example,  spectro- 
graph^ can  be  employed  to  produce  an  analog  representation  of  speech 
which,  in  turn,  can  be  examined  visually  in  order  to  permit  the  ident- 
ification of  talkers.   In  fact,  Kersta's  (1962)  claim  that  speakers 
can  be  identified  from  simple  visual  examinations  of  spectrograms  has 
stirred  a  great  deal  of  controversy.   Because  this  technique  has  been 
colloquially  labeled  "vo icepr ints ,"  naive  individuals  are  led  to  be- 
lieve it  can  approximate  the  precision  of  identification  implied  by 
such  a  label  (i.e.,  its  implicit  association  with  fingerprints).   If 
such  were  the  case,  all  that  would  be  necessary  for  the  establishment 
of  an  almost  perfectly  differentiating  system  --  based  on  a  defined, 
albeit  subjective,  set  of  parameters  --  would  be  the  refinement  and 
formulation  of  the  procedure.   However,  as  Hoi  lien  (197*0  observed, 
such  efforts,  even  under  controlled  laboratory  conditions,  have  not 
reasonably  shown  that  the  "voiceprint"  method  is  capable  of  producing 
the  order  of  identification  accuracy  claimed.   Indeed,  most  other 
scientists  have  repudiated  the  method;  Vanderslice  (1966)  was  among 
the  first  to  do  so.   While  he  was  not  able  to  test  the  specific  met- 
rics used  by  Kersta  (since  they  were  not  disclosed),  he  did  show 
spectrograms  of  the  same  utterances  which  looked  alike  but  were  from 
different  talkers  as  well  as  some  which  looked  different  although  pro- 
duced by  the  same  speaker.   It  would  seem  obvious  from  Vanders 1  ice1 s 
observations  that  "voiceprint"  identification  cannot  be  based  on  a 
simple  visual  pattern  matching  scheme. 

The  voiceprint  method  has  been  attacked  by  others  also;  Bolt, 
Cooper,  David,  Denes,  Pickett  and  Stevens  (1970)   stress  that,  since 


Bolt  and  his  associates  were  selected  by  the  Technical  Committee  on 


the  identification  task  has  not  been  defined  by  a  set  of  objective  mea- 
sures, the  validity  and  reliability  of  this  method  of  speaker  identifi- 
cation currently  is  untestable.   Moreover,  controversy  exists  between 
the  antagonists  concerning  the  applicability  of  this  type  of  voice  ident- 
ification technique  to  "real-life  cases."  Thus,  even  if  the  problems  of 
interference  with  the  acoustic  signal  are  ignored,  Bolt  et  al  .  (1970) 
contend  that  false  identification  rates  can  reach  levels  as  high  as  63%  - 
at  least,  for  certain  tasks  and  observer  training.   Again,  in  1973  these 
same  authors  questioned  the  feasibility  of  using  this  visual  technique 
for  speaker  identification.   The  research  conducted  on  "voicepr i nts"  in 
the  interim  had  not  resolved  the  problems.   For  example,  they  considered 
the  error  rates  found  by  Tos i ,  Oyer,  Lashbrook,  Pedrey,  Nichol  and  Nash 
(1972)  to  be  too  large  for  practical  applications,  i.e.,  the  use  of 
"voiceprint"  identification  in  investigations  and  judicial  or  forensic 
applications.   In  response,  Black,  Lashbrook,  Nash,  Oyer,  Pedrey,  Tos i 
and  Truby  (1973)  accused  the  Bolt  group  of  judging  the  technique  without 
directly  observing  it  and  further  charged  that  the  critics  of  the  sys- 
tem disregarded  factors  important  to  its  success,  e.g.,  examiner  train- 
ing, examination  time,  sample  size,  etc.   However,  even  after  this  dis- 
cussion, Poza  (1973)  is  still  of  the  opinion  that  before  the  "voice- 
print"  technique  can  be  used,  the  reliability  of  the  method  must  be  de- 
termined.  Poza  feels  that  speaker  characteristics  are  probably  repre- 
sented in  spectrograms  and  can  be  detected  by  examiners  but,  since 
"voicepr ints"  are  to  be  used  in  a  forensic  situation,  those  examiners 
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must  be  trained  on  materials  that  approximate  practical  cases.   In  any 
event,  the  argument  is  unresolved  and  it  should  be  noted  that  a  sub- 
stantial segment  of  the  scientific  community  is  currently  unwilling  to 
accept  "voicepr ints"  as  a  viable  identification  system  and,  moreover, 
does  not  hold  much  hope  for  its  being  refined  sufficiently  to  become  a 
reliable  system.   Even  Tos i  (197*0  »  a  long  term  proponent  of  "voice- 
prints,"  has  admitted  that  a  single  completely  reliable  speaker  identi- 
fication technique  is  not  currently  available  and  one  may  never  be 
found. 

The  "machine"  approach  to  speaker  identification  is  not  confined 
to  the  time/amplitude/frequency  type  of  spect rography.   Recently,  other 
researchers  have  studied  this  problem  using  other  techniques  of  this 
type,  i.e.,  quantifiable  acoustic  parameters  that  are  amenable  to  ma- 
chine processing.   The  spectral  composition  of  an  utterance  can  be  por- 
trayed in  forms  other  than  spectrograms.   For  example,  the  output  of  a 
spectral  analysis  device  can  be  a  set  of  relative  numeric  values  which 
reflect  the  acoustic  energy  as  a  function  of  frequency.   Using  such 
spectral  energy  representations,  Bricker,  Gnanades i kan ,  Mathews, 
Pruzansky,  Tukey,  Wachter  and  Warner  (1970  reported  achieving  rather 
high  speaker  recognition  scores  (90%  or  better).   Gubrynowicz  (1973) 
and  Kosiel  (1973),  using  15  and  10  talkers,  respectively,  found  that 
they  could  identify  all  of  their  speakers  from  a  spectral  analysis  of 
their  vocal  output.   Majewski  and  Hoi  1 ien  (197*0  also  used  a  spectral 
analysis  technique  (namely,  long-term  power  spectra),  where  an  average 
of  nearly  95%  correct  identification  was  achieved  for  two  groups  of  50 
talkers  --  rather  large  populations  in  speaker  identification  research. 
Therefore,  it  would  seem  that  it  is  likely  that  some  portrayal  of  the 


distribution  of  acoustic  energy  --  either  instantaneous  or  long-term  -- 
may  be  incorporated  in  a  speaker  identification  system. 

Research  on  the  speaker  identification  problem  has  included  an 
examination  of  the  fundamental  frequency  of  phonation  as  a  viable  indi- 
cator (Atal,  1972;  Hoi  lien,  Hoi  lien  and  Majewski,  197^;  lies,  1972; 
Sambur,  1973;  Wolf,  1972).   Thus  far,  it  has  not  been  shown  to  be  as 
effective  as  the  measures  of  spectral  composition.   Sambur  (1973)  re- 
ported that  f  contours,  which  represent  changes  in  speaking  fundamen- 
tal frequency,  were  useful  but  of  somewhat  less  importance  than  other 
measures,  namely,  formant  location  and  bandwidth.   However,  Wolf  (1972), 
using  an  array  of  acoustic  measures  found  that  the  glottal  fundamental 
frequency  could  be  used  with  some  success.   Therefore,  it  appears  that 
some  measures  of  the  frequency  of  the  source  wave  of  the  voice  may  prove 
to  be  relatively  functional  in  speaker  identification. 

Temporal  factors  may  also  prove  to  be  useful  in  speaker  identi- 
fication.  For  example,  speaking  time  (ST)  may  prove  to  have  the  capac- 
ity to  differentiate  between  talkers.   In  general,  the  term  speaking 
time  is  used  to  denote  the  portion  of  time  during  a  complete  utterance 
that  acoustic  energy  is  present.   While  ST  has  not  been  tested  as  yet 
as  a  discriminator  of  talkers,  this  characteristic  of  speech  production 
has  been  examined  previously.   For  example,  Holbrook  (1973)  reported 
normative  data  on  the  total  speaking  time  for  four  speech  pathology 
professors  during  an  eight-hour  day.   He  estimated  that  ST,  the  actual 
time  that  a  speaker  is  producing  an  acoustic  signal,  constituted  60% 
of  the  total  time  to  produce  the  utterance.   The  remaining  W0  was  de- 
voted to  voiceless  speech-time  and  pauses.   From  data  such  as  these, 
one  might  hypothesize  that  talkers  differ  in  terms  of  the  relative  dur- 


at  ion  of  segments  containing  acoustic  energy.   Moreover,  if  the  times 
are  recorded  when  a  speaker  is  and  is  not  producing  an  acoustic  signal 
during  the  reading  of  a  particular  passage  the  i ntra-speaker  differences 
may  be  relatively  small  in  comparison  with  inter-speaker  differences  and 
thus  provide  an  aid  in  identification. 

There  are  many  acoustical  and  temporal  factors  that  might  be  used; 
however,  it  would  be  beyond  the  scope  of  any  single  study  to  examine  all 
of  the  parameters  or  sets  of  parameters  (vectors)  which  may  be  functional 
in  identifying  individuals  from  their  speech.   Nevertheless,  there  are 
several  sets  of  objectively  measured  parameters  that  are  of  special  in- 
terest in  that  they  hold  the  potential  of  eventually  being  incorporated 
into  some  sort  of  speaker  identification  system.   Briefly,  several  sets 
of  acoustic/temporal  measures  will  be  selected  and  their  power  in  a 
speaker  identification  system  will  be  examined:   (1)  long-term  speech 
spectra  (LTS),  the  distribution  of  acoustic  energy  as  a  function  of 
frequency,  (2)  speaking  fundamental  frequency  (SFF) ,  the  characteris- 
tics of  a  talker's  glottal  source  wave  and  (3)  speaking  time  (ST),  the 
amount  of  time  phonation  is  present  during  the  total  speaking  time. 
Long-Term  Power  Spectra  (LTS) 

As  stated,  Bricker  et  al  .  (1970,  Gubrynowicz  (1973),  Kosiel  (1973): 
Majewski  and  Hoi  lien  (197*0  and  Hoi  lien,  Majewski  and  Hoi  lien  (197*0 
have  used  long-term  speech  spectra.   In  particular,  Majewski  and  Hoi  lien 
097*+)  obtained  speech  samples  from  two  groups:   50  Americans  and  50 
Poles.   Four  sets  of  comparisons  were  made  for  each  group  resulting  in 
percent  correct  identification  for  the  Poles  of  96,  98,  98  and  98,  and 
for  the  Americans  of  Sk,    96,  8k   and  90.   The  overall  average  for  both 
was  9*+. 2%,  a  rather  impressive  identification  rate  for  populations  as 


large  as  those  used  in  this  research.   In  the  same  study,  the  authors 
reported  correct  identification  scores  for  a  limited  passband,  315-3150 
Hz.   In  this  case,  the  scores  were  82%  for  the  Poles  and  70%  for  the 
Americans.   Considering  the  reduction  in  correct  identifications,  it  is 
apparent  that  some  of  the  information  eliminated  was  useful  in 
speaker  identification.   However,  the  overall  performance  of  LTS  for 
these  relatively  large  groups  demonstrates  that  this  measure  has  the 
potential  to  be  included  in  a  speaker  identification  system. 
Speaking  Fundamental  Frequency  (SFF) 

The  impression  of  the  pitch  of  a  speaker's  voice  is  an  intuitively 
appealing  characteristic  providing  a  clue  to  his  identity.   For  example, 
Atal  (1972)  examined  the  feasibility  of  speaker  recognition  based  on 
pitch  contours  —  actually  changes  in  SFF.   Six  utterances  from  each  of 
ten  male  talkers  were  normalized,  i.e.,  adjusted  to  have  the  same  dura- 
tion, and  moment-to-moment  changes  in  SFF  were  extracted.   First,  a 
cross-correlation  between  pairs  of  contours  was  computed;  the  highest 
mean  correlation  coefficient  indicated  that  the  test  samples  consti- 
tuted the  product  of  one  speaker.   This  analysis  produced  a  score  of  70%. 
Later  a  minimum-distance  classification  technique  (similar  to  that  of 
Hoi  lien,  Hoi  lien  and  Majewski,  197*0  was  utilized  and  a  correct  ident- 
ification rate  of  68%  was  obtained.   This  result  was  not  significantly 
different  from  those  obtained  from  the  correlation  method.   Since  pitch 
contours  provide  a  measure  of  d i scr iminab i 1 i ty  among  talkers,  it  is 
possible  that  other  measures  of  fundamental  frequency,  mean  SFF  (fQ) 
and  its  variability  (PS),  may  also  be  useful  in  differentiating  talkers. 
Indeed,  Hoi  lien,  Hoi  lien  and  Majewski  developed  a  technique  based  upon 
these  measures  which  produced  correct  identification  rates  ranging  from 


80.0  to  100.0%  for  three  groups.   Even  if,  as  in  Atal's  study,  the 
measures  are  not  exceptionally  functional  as  determiners  of  speaker 
identity  when  used  alone,  they  may  enhance  overall  identification  cap- 
ability when  combined  with  other  parameters. 
Speaking  Time  (ST) 

While  it  has  been  noted  that  ST  has  not  been  examined  previously  as 
a  measure  upon  which  to  base  speaker  identification,  it  is  possible  that 
the  vector  could  be  rather  effective.  Holbrook's  (1973)  report  provides 
only  normative  data  on  a  group  of  professors.  However,  the  vector  could 
be  powerful  if  talkers  are  consistent,  i.e.,  the  proportion  of  time  that 
acoustic  energy  is  being  produced  is  stable  from  one  speech  sample  to 
another.  Further,  there  must  be  variability  among  subjects. 
Vector  Combinations 

In  addition  to  using  each  of  the  three  vectors  separately,  as  dis- 
cussed above,  it  is  possible  to  develop  a  composite  vector  comprised  of 
all  or   part  of  the  basic  set,  thereby  selecting  a  parameter  set  that  will 
be  most  effective.   For  example,  Sambur  (1973)  found  that  in  order  of  im- 
portance, formant  location  ( F3  and  Fk    in  vowels),  bandwidth,  and  mean 
speaking  fundamental  frequency  are   all  effective  features  upon  which  to 
make  judgements  of  a  speaker's  identity.   Further,  he  achieved  extremely 
low  error  rates  --  as  low  as  0.003  --  by  combining  these  elements.   Thus, 
while  the  speaker  identification  potential  of  each  of  the  vectors  (LTS, 
SFF  and  ST  applied  separately)  is  of  particular  interest,  certain  com- 
binations may  prove  to  be  more  effective  when  they  are  combined  into  a 
composite  vector  --  even  if  one  of  the  members  appeared  to  be  a  rather 
poor  predictor  when  used  alone.   For  example,  if  one  vector  is  indepen- 
dent of  the  second  (with  which  it  is  paired),  then  the  identification 


rates  of  the  pair  should  show  substantial  improvement  over  their  sep- 
arate identification  rates.   For  example,  it  is  probable  that  SFF  and 
ST  are  independent.   During  speech,  a  talker  can  initiate  and  terminate 
glottal  activity  at  any  time  and  anywhere  in  his  phonational  range. 
Therefore,  one  vector  may  not  affect  the  other  and  it  is  reasonable  to 
expect  much  better  results  when  SFF  and  ST  are  used  conjointly  than 
when  they  are  used  alone.   Conversely,  combining  vectors  may  not  im- 
prove speaker  identification  levels  if  little  information  is  present 
in  one  vector  that  is  not  contained  in  the  other.   In  any  case,  the 
interrelationships  of  the  three  vectors  currently  is  unknown  and, 
therefore,  their  discriminating  power  when  used  together  cannot  be  pre- 
d  icted  at  this  t  ime. 
Pi  stort  ions 

To  properly  evaluate  any  identification  system,  some  attention 
must  be  paid  to  the  ability  of  the  system  --  and  its  several  compon- 
ents --  to  function  properly  in  the  presence  of  distortions  of  various 
classes.   Here,  the  term  distortion  is  meant  to  refer  to  any  factor 
that  may  produce  changes  in  the  speech  signal.   Distortions  may  arise 
either  in  the  production  or  transmission  of  the  acoustic  signal. 
During  transmission,  for  example,  the  acoustic  signal  may  be  distorted 
by  effects  of  limited  frequency  response  in  the  transmission  line.   The 
most  common  occurrence  of  the  filtering  of  speech  is  associated  with 
telephone  conversations.   While  the  filtering  quality  of  telephone  com- 
munications is  related  to  transmission  distance,  especially  from  the 
subscriber  to  a  control  office,  a  nominal  frequency  response  is  from 
300  to  3000  Hz.   The  obvious  effect  of  such  frequency  limiting  would  be 
to  reduce  the  amount  of  information  available  to  any  vector,  especially 


one  which  is  frequency  dependent,  thereby  degrading  the  effectiveness 
of  that  set  of  measures. 

The  distortions  discussed  thus  far  are  related  to  one  class  of 
transmission  line  interference.   Additional  distortions  may  arise  in 
the  production  of  the  speech  signal.   For  example,  Davitz  and  Davitz 
(1959)  demonstrated  that  emotions  are  portrayed  in  the  acoustic  sig- 
nal.  In  this  study,  listeners  were  found  to  be  able  to  reliably  ident- 
ify the  intended  emotion  of  male  and  female  talkers  reading  the  alpha- 
bet.  Thus,  it  is  possible  that  such  alterations  by  a  speaker  may  inter- 
fere with  his  correct  identification,  i.e.,  an  individual  may  produce 
sufficiently  variant  speech  exemplars  when  emotionally  aroused  (as 
opposed  to  when  he  is  relaxed)  that  his  utterances  may  be  classified  as 
being  from  different  talkers.   Using  both  machine  and  perceptual  tech- 
niques, Majewski  et  al .  (unpublished  manuscript)  and  Hoi  lien,  Majewski 
and  Hoi  lien  (197*0,  respectively,  demonstrated  that  samples  drawn  from 
speakers  while  they  were  under  stress  were  correctly  identified  less 
frequently  than  samples  produced  normally. 

Implicit  in  the  discussion  of  the  effects  of  psychological  and 
physiological  factors  upon  the  production  of  speech  is  the  concept  that 
the  talkers  may  not  be  consciously  attempting  to  modify  that  signal. 
The  speaker  is  simply  trying  to  say  something  and  his  vocal  output  may 
reflect  the  pressure  of  an  interfering  stimulus.   However,  a  speaker 
also  can  exercise  extensive,  conscious  control  over  his  vocal  output. 
The  extent  to  which  a  speaker  can  actively  vary  the  acoustic  signal 
should  be  at  least  as  great  as  those  associated  with  emotionally  based 
distortions.   Thus,  a  talker,  by  disguising  the  voice,  should  be  able 
to  increase  confusion  about  his  identity.   Again,  Majewski  et  al . 
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(unpublished  manuscript)  and  Hoi  1 ien,  Majewski  and  Hoi  lien  (197*0,  re- 
ported a  reduction  in  the  number  of  correct  identifications  of  speakers 
attempting  disguise  than  from  those  observed  for  normally  produced  sam- 
ples.  Further,  these  levels  were  lower  than  those  reported  for  the 
stress  condition. 

In  any  case,  there  are  a  variety  of  ways  to  distort  a  speech  sig- 
nal and  it  is  of  interest  to  determine  whether  the  selected  objective 
measures  are  sensitive  to  distortions  generated  in  transmission,  i.e., 
limited  passband,  and  in  production,  i.e.,  stress  or  disguise.   In 
other  words,  if  the  same  acoustic  parameters  used  to  analyze  normal 
productions  are  extracted  from  these  altered  samples,  how  will  their 
accuracy  in  discriminating  talkers  be  altered? 
Statistical  Analysis 

The  statistical  method  used  in  a  particular  piece  of  research  may 
have  some  bearing  on  the  results.   For  example,  Atal  (1972)  employed 
both  minimum-distance  and  cross-correlation  techniques  to  analyze  his 
data.   Both  methods  should  have  yielded  identical  results  but,  in  fact, 
they  did  produce  slightly  different  scores  (68%  vs_  70%  correct  identi- 
fications, respectively).   In  addition,  Atal  computed  a  moments  of 
pitch-period  distribution  for  the  same  data  and  obtained  78%  correct 
identifications.   Although  the  differences  for  the  three  approaches 
are  not  dramatic,  it  is  possible  that  other  methods  could  produce  sub- 
stantially different  levels  of  identification. 

In  any  case,  a  variety  of  statistical  approaches  are  available 
to  assist  in  the  analys i s/ interpretat ion  of  research  of  this  type. 
For  example,  Majewski  and  Hoi  lien  (197*0  utilized  a  Euclidean  distance 
technique  in  which  a  distance  is  calculated  from  the  location  of  an 


observation  to  the  location  of  each  reference  and  the  observation  is 
assigned  to  the  closest  reference.   Zalewski,  Majewski  and  Hoi  lien  (in 
press)  employed  a  cross-correlation  statistical  technique  and  showed  a 
slight  improvement  over  the  results  of  a  Euclidean  distance  technique  -- 
based  on  the  same  data  --  used  by  Majewski  and  Hoi  lien  (197*0.   For 
cross-correlations,  observations  are  classified  as  coming  from  the  same 
talker  whose  reference  utterance  is  most  highly  correlated.   On  the 
other  hand,  Bricker  et  al,  (1971)  are  of  the  opinion  that  a  discrim- 
inant analysis  is  an  appropriate  technique.   In  this  case,  a  general- 
ized square  distance  is  computed  based  upon  a  covariance  matrix  --  a 
slightly  different  approach  from  either  the  Euclidean  distance  or  cross- 
correlation  approaches.   Further,  discriminant  analysis  does  permit  the 
examination  of  the  three  vectors,  LTS,  ST  and  SFF,  under  the  various 
cond  i  t  ions. 

Statement  of  the  Problem 
There  are  many  approaches  to  the  speaker  identification  problem; 
most  have  not  been  tested.   Neither  the  characteristics  of  the  voice 
that  make  identification  possible  nor  the  techniques  for  evaluating 
those  qualities  have  been  determined.   It  would  be  of  practical  benefit 
to  consider  several  sets  of  parameters  which  may  be  relatively  powerful 
in  discriminating  among  talkers.   Long-term  speech  spectra,  speaking 
time  and  speaking  fundamental  frequency  will  be  examined.   These  vectors 
will  be  studied  both  under  idealized  conditions  and  while  being  exposed 
to  possibly  contaminating  physical  and  psychological  factors.   LTS  will 
be  subjected  to  filtering  in  order  to  determine  the  vector's  effective- 
ness under  that  type  of  distortion.   All  three  basic  vectors  will  be 
examined  when  production  distortions  are  present,  i.e.,  stress  and  dis- 
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guise.   The  analysis  system  should  not  only  make  efficient  use  of  the 
data  but  also  allow  for  the  straightforward  examination  of  any  subset 
of  the  parameters.   Therefore,  a  discriminant  analysis  was  used  to 
identify  talkers  from  their  speech  alone  and  it  permitted  the  exami- 
nation of  the  effect  of  any  portion  of  a  vector  or  any  combination  of 
the  vectors. 

Currently,  an  adequate  technique  of  speaker  identification  does 
not  exist  and  the  analytic  method  to  select  the  correct  talker  has  not 
been  determined.   It  would  seem  that  research  directed  at  examining 
voice  qualities  and  analysis  procedures  is  warranted.   Therefore,  this 
research  directly  addresses  itself  to  four  specific  questions: 

A.  Are  long-term  power  spectra,  speaking  fundamental  frequency 
and  speaking  time  sufficiently  powerful  to  be  used  singly  to 
identify  a  talker  from  his  speech? 

B.  Are  these  parameters  effective  measures  upon  which  to  base 
judgements  of  speaker  identification  when  used  in  various 
comb  inat  ions? 

C.  Are  these  specified  parameters  sufficiently  robust  to  func- 
tion in  the  presence  of  distortions  --  limited  passband, 
stress  or  disguise? 

D.  Which  statistical  procedure,  a  Euclidean  distance,  cross- 
correlation  or  discriminant  analysis,  provides  the  highest 
correct  speaker  identification  levels  (as  indicated  by  com- 
parison based  on  LTS  (fullband)  vector  for  the  group  of  50 
speakers) ? 


PROCEDURE 

The  present  experimentation  addresses  two  points;  the  assessment 
of  several  acoustic  parameter  sets  functioning  in  a  speaker  identifica- 
tion task,  and  the  most  effective  method  to  analyze  the  data. 

The  primary  focus  of  this  research  is  directed  at  the  capability 
of  sets  of  selected  acoustic  measures  --  used  alone  or  in  concert  —  to 
detect  the  identity  of  a  talker  from  a  number  of  speakers  who  may  be 
talking  normally,  under  stress  or  while  attempting  a  disguise. 
Identification  Parameters 

Three  sets  of  parameters  are  examined  in  terms  of  their  contribu- 
tion to  speaker  identification:  long-term  power  spectra,  speaking  fun- 
damental frequency  and  speaking  time.  The  evaluations  of  the  parameter 
sets  are  conducted  to  determine  the  effectiveness  of  each  one  as  a  dis- 
criminator a)  if  that  set  were  the  sole  basis  for  identification  and  b) 
their  potential  when  used  conjointly. 

Long-Term  Power  Spectra.1   LTS  is  a  measure  of  the  distribution  of 
acoustic  energy  during  a  set  time  interval.   The  long-term  spectral  char- 
acteristics for  all  speakers  were  extracted  from  the  recorded  speech 


While  the  description  of  long-term  power  spectra  specifies  a  vector 
of  23  parameters,  virtually  all  analyses  are   conducted  with  only  11 
of  these  values.   Therefore,  unless  specifically  noted  as  otherwise, 
all  references  to  LTS  apply  to  the  subset  of  11  measures. 
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samples  by  means  of  a  General  Radio  1921  Real-time  Analyzer.   These  anal- 
yses were  performed  in  23  one-third  octave  bands  covering  the  frequency 
range  from  80  to  12,500  Hz.   In  order  to  minimize  the  influences  of  the 
speech  content  upon  the  speech  spectrum,  the  longest  integration  time 
available  (i.e.,  32  sec.)  was  used.   The  sampling  rate  for  this  value  of 
integration  time  was  32  samples  per  second;  four  speech  samples,  each 
of  32-sec.  duration,  were  analyzed  (serially)  for  each  subject. 

The  values  extracted  during  the  spectral  analyses,  expressed  in  dB 
levels  for  each  of  23  frequency  bands,  were  printed  on  paper  tape  by 
means  of  an  MDS  800  Printer.   Since  the  printing  time  lasted  only  four 
seconds,  it  was  possible  to  analyze  the  samples  without  the  necessity  of 
stopping  the  input  tape  recorder  after  each  sample.   For  both  groups,  a 
total  of  300  speech  spectra  (each  LTS  vector  expressed  by  23  parameter 
values)  were  obtained  for  further  processing. 

An  intermediate  data  processing  step  was  carried  out  on  an  IBM 
370/165  computer.   Specifically,  the  data  for  each  speech  sample  were 
normalized  in  order  to  equalize  the  overall  levels  of  the  samples.   An 
arbitrary  total  power  level  was  used;  i.e.,  50  dB  re:  the  full  frequency 
band  under  investigation.   In  other  words,  the  energy  measured  in  each 
one-third  octave  band  is  recorded  relative  to  50  dB  which  represents 
the  energy  present  in  the  total  signal. 

Speaking  Fundamental  Frequency.   The  second  set  of  parameters, 
SFF,  are   measures  of  the  rate  of  vibration  of  the  vocal  folds,  i.e.,  the 
lowest  frequency  in  the  glottal  wave  that  contains  energy.   Two  measures 
were  utilized  --  mean  speaking  fundamental  frequency  (fQ)  and  pitch 
sigma  (PS),  i.e.,  standard  deviation  of  the  distribution.   Mean  funda- 
mental frequency  measures  of  each  subject's  reading  were  obtained  using 


the  Fundamental  Frequency  Indicator  (FFI-6),  a  digital  readout  fQ 
tracking  device  consisting  of  a  group  of  successive  low-pass  filters 
with  cut-offs  at  half-octave  intervals  coupled  with  high-speed  switch- 
ing circuits  which  are   controlled  by  a  logic  system.   FFI  produces  a 
string  of  pulses  —  each  pulse  marking  the  boundary  of  a  fundamental 
period  from  complex  speech  waves  --  which  are  delivered  to  a  Digital 
Electronics  Corporation  PDP-8  computer.   An  interval  clock  marks  the 
time  from  pulse  to  pulse  and  these  values  are  processed  digitally  to 
yield  (among  other  data)  the  geometric  mean  frequency  and  standard 
deviation  of  the  frequency  distribution.   The  importance  of  SFF  is 
tested  using  discriminant  analysis.   This  vector  will  be  used  alone 
and  in  combination  with  other  vectors  to  test  its  value  in  identifying 
speakers. 

Speaking  Time.   Finally,  ST,  which  measures  the  amount  of  time 
any  acoustic  energy  is  present  during  a  total  utterance,  was  the  third 
parameter  studied.   As  used  in  this  research,  this  vector  also  is  com- 
posed of  two  parameters.   The  first  ST  parameter  measures  the  total 
time  that  an  acoustic  signal  is  present  during  an  individual's  total 
utterance.   In  this  case,  time  measures  are  extracted  for  each  of  the 
four  32-second  segments  of  the  reading  passage.   The  second  (parameter) 
is  a  measure  of  the  talker's  rate,  i.e.,  the  amount  of  speech  material 
completed  during  a  32-sec.  segment.   Initially,  it  is  unclear  whether 
the  number  of  words  or  the  number  of  phonemes  produced  would  produce 
the  more  stable  measure  of  a  talker's  output.   Therefore,  both  word 
and  phoneme  counts  per  segment  were  recorded  and  tested  to  verify  if 
one  was  more  appropriate  to  the  speaker  identification  task,  with  the 
better  one  being  used  in  all  subsequent  analyses. 
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To  obtain  the  first  measure,  a  special  device  has  been  developed 
which  has  two  outputs:   1)  a  1  kHz  continuous  signal  that  is  produced 
throughout  the  duration  of  the  utterance  and  2)  a  1  kHz  pulse  train 
which  is  present  whenever  acoustic  energy  exceeds  a  threshold.   By 
feeding  these  signals  to  a  pair  of  electronic  counters,  the  total  time 
of  the  utterance  and  the  amount  of  time  that  phonation  is  produced  can 
be  measured  in  milliseconds  (ms).   The  second  parameter  was  obtained  by 
recording  the  first  and  last  utterance  in  a  segment  and  then  counting 
the  number  of  words  or  phonemes  produced  in  that  segment.   As  with  the 
previous  set  of  parameters,  similar  procedures  were  employed  to  deter- 
mine their  value  in  differentiating  talkers  and  were  tested  singly  and 
in  combination  with  other  parameter  sets. 
Speaker  Populations 

The  speaker  population  utilized  in  this  research  was  composed  of 
two  groups.   The  first  group  contained  50  college-age  male  talkers  from 
the  University  of  Florida.   They  were  adult  speakers  of  English,  aged 
18-25  years,  and  demonstrated  no  observable  speech  or   voice  problems. 
Generally,  such  talkers  might  be  expected  to  be  quite  homogeneous  in 
their  speech  patterns,  e.g.,  rate,  dialect,  etc.,  and  present  a  reason- 
ably difficult  population  from  which  to  correctly  identify  members. 
Further,  they  produced  only  samples  of  their  normal  everyday  speech. 

The  second  group  of  subjects  was  selected  from  a  population  of 
more  mature  males.   In  this  case,  25  normal  speakers  of  American  En- 
glish from  approximately  25  to  h5   years  of  age  were  included.   Members 
of  this  smaller,  more  variable  group  should  be  more  readily  identifi- 
able from  their  speech  --  a  premise  that  was  tested.   In  addition, 
these  subjects  supplied  --  along  with  their  normal  productions  --  sam- 


pies  that  were  distorted  in  two  ways,  i.e.,  they  were  produced  under 
stress  and  as  the  subjects  attempted  a  disguise;  these  speaking  condi- 
tions will  be  described  below. 
Speech  Material 

All  subjects  read  a  modernization  of  Robert  Louis  Stevenson's  "An 
Apology  for  Idlers"  (see  Appendix);  this  passage  takes  about  2.5  min- 
utes to  read.   In  order  to  gather  samples  of  normal  speech  production, 
talkers  were  instructed  to  read  the  passage  "as  naturally  as  possible." 
Additional  readings  of  the  essay  were  requested  from  one  of  the  two 
groups  of  the  subjects  while  under  stress  and  while  attempting  a  dis- 
guise.  The  stress  was  induced  by  randomly  applying  a  mild,  varying 
electric  shock  during  the  subject's  reading.   For  the  disguised  voice, 
talkers  were  allowed  to  modify  their  voices  in  any  way  except  through 
the  use  of  a  "foreign  dialect"  or  by  whispering. 
Signal  Distortions 

The  speech  samples  used  in  this  study  are  "ideal,"  i.e.,  they  were 
recorded  with  extreme  care  under  laboratory  conditions  without  observ- 
able distortions.   Such  controls  are  necessary  if  one  is  to  test  the 
best  possible  outcome  from  the  speaker  identification  system  --  a  mea- 
sure of  considerable  interest  —  since  the  primary  focus  is  directed  at 
the  maximum  discriminant  capability  of  the  vectors.   However,  the  sys- 
tem's effectiveness  in  the  presence  of  typical  distortions  is  also  a 
critical  measure.   It  is  not  within  the  scope  of  this  project  to  exam- 
ine the  identification  vectors  under  very  many  conditions  of  distortion 
or  field  situations  but,  considering  the  criticism  of  Bolt  et  al .  (1970) 
of  the  "voicepr int"  technique  (i.e.,  that  consideration  had  not  been 
accorded  to  practicalities),  some  attempt  will  be  given  to  evaluating 
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the  ability  of  the  system  to  function  in  less  than  ideal  conditions. 
For  this  reason,  LTS  was  subjected  to  both  production  and  transmission 
distortions  in  order  to  obtain  an  indication  of  the  magnitude  of  change 
in  the  system's  correct  identification  score.   Specifically,  LTS  was 
subjected  to  a  simulated  system  distortion,  i.e.,  a  limited  passband 
of  315-3150  Hz.   The  simulation  is  accomplished  by  removing  the  six 
parameters  which  contain  the  information  below  315  Hz  and,  similarly, 
by  removing  the  six  parameters  which  contain  the  information  above 
3150  Hz.   The  result  is  a  LTS  vector  of  11  rather  than  23  parameters. 
ST  should  be  insensitive  to  the  effects  of  filtering  of  this  type  -- 
as  would  the  SFF  data.   Accordingly,  LTS  was  the  only  parameter  sub- 
jected to  filtering.   However,  all  three  parameters  were  exposed  to 
production  distortions,  i.e.,  stress  and  disguise.   Finally,  all  vec- 
tors were  subjected  singly  and  in  combination  to  the  respective  dis- 
tortions. 
Statistical  Analysis 

A  discriminant  analysis  technique  was  used  to  identify  speakers. 
Fundamentally,  discriminant  analysis  uses  measures  obtained  from  known 
classes  —  speakers  in  this  case  --  in  order  to  determine  a  classifi- 
cation criterion.   Then,  additional  observations  are   compared  to  the 
criterion  for  each  class  and  classified  in  the  set  which  it  most 
closely  resembles.   Basically,  this  technique  is  one  of  pattern  match- 
ing.  Obviously,  the  classification  criteria  for  the  test  sample  must 
be  present  --  i.e.,  the  unknown  speaker  must  be  represented  in  the  set 
of  known  talkers  --  because  the  discriminant  analysis  classifies  every 
test  sample  into  some  category,  i.e.,  the  best  match. 

Specifically,  the  Statistical  Analysis  System  (SAS)  discriminant 
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analysis  package  was  employed.   Since  the  computation  of  the  discrim- 
inant function  requires  two  or  more  sets  of  measurements  from  each 
group  (speaker)  to  establish  a  criterion  for  those  groups,  the  second, 
third,  and  fourth  32-second  segments  of  the  normally  rendered  reading 
were  used  to  describe  the  characteristics  of  the  talkers.   For  both 
groups  of  talkers,  the  first  segment  of  the  normal  reading  was  treated 
as  the  "unknown."   For  the  second  group  (of  25  talkers)  in  addition  to 
the  normal  sample,  the  first  32-second  segment  of  the  "stress"  and  "dis- 
guise" readings  were  also  treated  as  "unknowns."   In  all  cases,  the 
first  segment  constituted  the  test  and  the  three  remaining  segments 
were  used  to  establish  the  reference  (1  vs  2,3,/+).   All  of  the  obser- 
vations defined  as  "unknowns"  were  classified  according  to  the  dis- 
criminant functions.   A  "correct  identification"  was  made  when  an  ob- 
servation was  matched  with  the  remaining  samples  from  the  same  speaker 
and  a  total  number  of  correct  identifications  was  obtained  for  each 
condition.   After  converting  the  totals  to  percent  correct  identifi- 
cation, the  effectiveness  of  each  vector  was  examined  both  as  a  single 
vector  and  in  combination  and  under  the  specific  distortions  applied 
to  it. 

The  discriminant  analysis  also  performs  a  posterior  classifica- 
tion of  the  reference  samples,  i.e.,  the  observations  used  to  define 
each  talker  are  subsequently  classified  as  though  they  were  "unknowns." 
The  level  of  correct  classification  of  the  reference  observations  is 
indicative  of  the  expected  level  of  performance  of  that  vector  in 
properly  identifying  the  test  observations.   In  other  words,  if  the 
three  segments  used  as  references  are  rather  similar,  the  "knowns" 
should  be  properly  classified.   Further,  when  the  test  samples  are 


correctly  classified,  we  can  assume  with  some  confidence  that  it  is 
not  a  chance  occurrence. 
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RESULTS  AND  DISCUSSION 

Several  approaches  have  been  incorporated  in  this  study,  i.e., 
various  acoustic  measures  and  analysis  techniques  as  well  as  distor- 
tions of  the  speech  signal.   The  data  for  each  of  the  single  param- 
eters are   presented  in  toto  followed  by  the  results  for  these  vec- 
tors used  in  combination.   Each  of  these  sections  contains  the  re- 
sults of  the  posterior  classification  of  the  reference  (or  "known") 
samples  which  reflect  the  relative  within  and  between  talker  vari- 
ability.  The  posterior  classifications  of  the  "known"  talkers  are 
listed  because  they  provide  some  insight  into  the  stability  of  the 
groups  defined  by  the  vector.   Following  the  presentation  of  that 
data,  the  information  concerning  the  classification  of  the  test  sam- 
ples is  presented.   Finally,  identification  rates  in  the  presence  of 
speech  distortions  are  reported.   In  the  case  of  the  LTS  vector,  two 
additional  factors  are  considered,  namely,  the  effect  of  filtering  on 
speaker  identification  and  comparison  of  the  statistical  approach  em- 
ployed to  classify  the  talkers. 
Long-Term  Power  Spectra  (LTS) 

Table  1  lists  the  correct  identification  rates  for  the  posterior 
classifications  performed  on  both  groups.   Both  LTS  vectors,  fullband 
and  filtered,  demonstrate  rather  high  correct  identification  rates  for 
the  posterior  classifications.   Therefore,  the  correct  classification 
of  the  test  samples  should  be  expected  to  be  very  high  whenever  either 


22 


23 


Table  1.   Classification  of  Observations  Based  on  Long-Term  Power 
Spectra  (LTS). 


Correct  Identification 
(in  %) 


Exper  imental 
Cond  i  t  ion 


Group  A 

N=50 
Normal 


Normal 


Group  B 

N=25 
Stress    Disguise 


Posterior  Classification 
of  "Knowns" 


LTS  (full  band] 
LTS  (passband' 


100.0 
98.0 


00.0 
00.0 


Classification  of  "Unknowns" 


LTS  (fullband) 
LTS  'passband) 


100.0 
76.0 


!00.0 
80.0 


72.0 
60.0 


24.0 
20.0 
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of  those  vectors  is  used. 

Specifically,  the  fullband  LTS  vector  achieved  the  100.0%  levels 
of  correct  speaker  identification  for  the  normal  productions  of  both 
populations.   The  23  parameters  of  this  vector  apparently  describe 
patterns  for  each  talker  that  are  unique  --  at  least  for  differenti- 
ating numbers  of  talkers  as  large  as  50  when  no  distortions  are  present. 

Some  concern  exists  that  such  a  high  identification  level  may  have 
resulted  from  a  fortuitous  selection  of  the  segments  to  be  used  as 
tests  and  references.   Majewski  and  Hoi  lien  (1974),  aware  of  this  prob- 
lem, examined  speaker  identification  levels  for  four  combinations  of 
test  and  reference  sets,  as  follows:   a)  the  fourth  sample  vs_  the  mean 
of  the  first  three  samples  (4-1,2,3),  b)  the  mean  of  the  first  three 
samples  vs  the  fourth  sample  (1,2,3-4),  c)  the  mean  of  the  last  two 
samples  vs_  the  mean  of  the  first  two  samples  (3,4-1,2)  and  d)  the  mean 
of  the  first  two  samples  y_s  the  mean  of  the  last  two  samples  (1,2-3,4). 
Their  results  --  using  LTS  with  the  group  of  50  speakers  --  showed 
that,  while  the  identification  scores  varied  somewhat  (from  84  to  96%), 
all  combinations  produced  rather  high  levels  of  speaker  identification. 
Therefore,  some  change  in  the  scores  may  be  expected  if  various  test 
and  reference  segments  are  used  but  the  levels  observed  for  any  combi- 
nation containing  at  least  two  segments  used  as  references  should  pro- 
duce comparable  levels  of  identification. 

LTS  was  also  examined,  for  this  investigation,  to  see  if  it  would 
continue  to  be  a  viable  means  of  identifying  talkers  if  it  were  sub- 
jected to  filtering  of  the  type  that  may  be  encountered  in  telephone 
communications.   In  other  words,  could  correct  identifications  of 
speakers  be  made  if  only  the  spectral  information  from  315-3150  Hz 
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were  available  rather  than  from  80-12,500  Hz?  To  simulate  such  a 
condition,  12  of  the  parameters  were  removed  from  the  LTS  vector.   As 
can  be  seen  in  Table  1,  the  filtered  condition  (i.e.,  11  parameters), 
showed  a  reduction  in  the  correct  classification  capability  of  24.0%, 
from  100.0%  for  the  fullband  to  76.0%  for  the  limited  one.   The  fact 
that  about  three-fourths  of  the  talkers  were  still  properly  identified 
is  encouraging  but  that  level  of  accuracy  is  unacceptable  for  most 
practical  applications.   The  same  comment  is  applicable  to  the  smaller 
group  since  its  accuracy  was  reduced  from  100.0%  to  80.0%. 

Production  distortions,  i.e.,  stress  and  disguise,  have  a  marked 
impact  on  speaker  identification  rates  at  least  for  this  identifica- 
tion parameter.   Even  the  LTS  (fullband)  vector  which  functioned  flaw- 
lessly for  normally  produced  speech  could  not  adequately  differentiate 
among  speakers.   While  a  score  of  72.0%  is  the  highest  achieved  for 
any  vector  or  combination  for  the  stress  samples  and  24.0%  is  among  the 
highest  for  disguised  speech,  both  distortions  have  an  obvious  effect 
on  speaker  identification.   Under  conditions  of  stress,  LTS  (fullband) 
maintains  a  moderate  level  of  performance.   The  remaining  LTS  vector, 
already  incorrectly  classifying  talkers  who  had  produced  normal  utter- 
ances, is  even  less  effective  with  the  stress  and  disguise  conditions 
providing  correct  identification  rates  of  60.0%  and  20.0%,  respectively. 

Of  particular  interest  in  this  study  is  a  comparison  of  the  three 
data  analysis  techniques  used  in  the  evaluation  of  the  effectiveness  of 
LTS  in  speaker  identification:   cross-correlation,  Euclidean  distance 
and  discriminant  analysis.   Previously,  a  special  program  had  been 
written  to  classify  talkers  based  on  the  shortest  distance  from  a  test 
sample  to  a  reference  sample  in  a  polyd imens ional  Euclidean  space 
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(Majewski  and  Hoi  lien,  197*0.   Tne  discriminant  analysis  portion  of  the 
SAS  package  was  also  used.   The  results  of  both  statistical  methods  are 
listed  in  Table  2,  as  are  the  scores  for  the  cross-correlation  technique 
employed  by  Zalewski  et  al .  (in  press).   For  both  conditions,  i.e., 
using  11  or  23  of  the  parameters  in  the  LTS  vector,  the  SAS  discriminant 
analysis  program  correctly  classified  more  talkers  than  the  Euclidean 
distance  or  cross-correlation  techniques.   Cross-correlations  produced 
the  same  results  as  were  found  using  Euclidean  distance.   In  this  case, 
discriminant  analysis  appears  to  be  a  more  effective  method  of  identi- 
fying talkers  from  the  spectral  composition  of  their  speech.   Based  on 
the  fact  that  every  investigator  cited  (Bricker  et  al . ,  1971;  Gubrynowicz 
1973;  Kosiel,  1973;  Majewski  and  Hoi  lien,  197*+  and  Hoi  lien,  Majewski 
and  Hoi  lien,  197*0  observed  levels  of  speaker  identification  in  excess 
of  90%,  it  would  appear  that  the  spectral  composition  of  an  individual's 
voice  contains  substantial  clues  to  his  identity. 

Again,  it  is  possible  that  the  high  scores  observed  for  LTS  may 
be  related  to  the  particular  test/reference  combinations  employed. 
In  every  other  procedure  in  which  discriminant  analysis  was  used,  the 
first  32-sec.  segment  was  the  test  and  the  remaining  three  segments 
constituted  the  reference  (1-2,3,*+).   However,  in  order  to  make  the 
results  of  the  discriminant  analysis  technique  directly  comparable  with 
the  others,  the  analysis  was  conducted  again  with  the  fourth  segment 
being  treated  as  the  test  sample  (4-1,2,3).   The  results  are   contained 
in  Tables  1  and  2;  a  76.0%  identification  level  was  achieved  when  the 
first  segment  in  the  test  was  the  reference  (1-2,3,*+)  and  84.0%  in  the 
other  case  (4-1,2,3).   In  addition,  a  similar  test  was  conducted  with 
a  group  of  25  talkers.   The  levels  of  correct  identification  were 
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Table  2.   A  Comparison  of  the  Classification  of  Speakers  by  Means  of 
Long-Term  Speech  Spectra  Using  Euclidean  Distance,  Cross- 
Correlation  and  Discriminant  Analysis  Techniques  (N=50). 


Correct  Identification 
(in  %) 

Classification  Technique Ful  lband Fi  1  tered 

Euclidean  Distance  96.0  70.0 

Cross-Correlation  96.0 

Discriminant  Analysis-'-         100.0  84.0 


In  this  case,  only  to  be  consistent  with  the  approach  used  by  the 
other  authors,  the  first,  second  and  third  32-second  samples  were 
used  as  the  reference  and  the  fourth  was  used  as  the  test  sample 
(4-1,2,3). 
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80. OX  (see  Table  I)  when  the  test  sample  was  the  first  segment  and 
92.0%  when  the  fourth  segment  was  used.   For  both  groups,  there  was 
some  improvement  when  the  last  segment  was  considered  to  be  the  test; 
however,  the  levels  of  identification  indicate  that  reasonably  similar 
results  would  be  obtained  for  any  single  test  sample  y_s  the  three  re- 
maining segments. 
Speaking  Fundamental  Frequency  (SFF) 

In  all  cases  where  SFF  is  employed  in  the  discriminant  analysis, 
the  number  of  speakers  is  reduced  from  50  to  k3    for  Group  A  and  from 
25  to  20  for  Group  B  because  the  SFF  vector  could  not  be  extracted  for 
all  subjects.   FFI  was  not  able  to  extract  reliable  measure  of  funda- 
mental frequency  in  several  cases  because  there  was  excessive  back- 
ground noise  or  weak  signals  on  the  recordings. 

Table  3  indicates  that  the  posterior  classification  of  the  ref- 
erence samples,  79.1%  for  the  larger  group  and  68.3%  for  the  smaller 
one,  is  well  below  the  level  achieved  by  the  LTS  vector.   On  this 
basis,  SFF  would  not  be  expected  to  be  as  effective  in  discriminating 
speakers  as  would  LTS.   Further,  the  posterior  classification  of  the 
large  group  is  higher  than  for  the  small  one  indicating  that  for  SFF, 
the  within  talker  variability  was  less  and/or  the  between  talker  vari- 
ability was  greater  for  the  large  group.   This  finding  was  unexpected 
since  the  talkers  in  the  group  of  50  were  college  students  who  were 
18-25  years  of  age  and  the  group  of  25  ranged  in  age  from  25  to  45. 
In  this  regard,  Hoi  lien  and  Shipp  (1972)  have  shown  that  the  mean 
fundamental  frequency  of  males  drops  from  120  Hz  to  107  Hz  from  the 
ages  of  25  to  kS    thus  indicating  that  there  should  be  greater  differ- 
ences between  members  of  Group  B.   Therefore,  the  within  group 
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Table  3.   Classification  of  Observations  Based  on  Speaking  Fundamental 
Frequency  (SFF). 


Correct  Identification 

(in  %) 


Posterior  Classification 
of  "Knowns" 

Identification  of  "Unknowns' 


Group  A 
N=50* 

Norma  1 

Normal 

Group  B 

N=25* 

Stress 

Di  sguise 

79.1 
30.2 

68.3 
35.0 

30.0 

10.0 

SFF  could  not  be  extracted  for  all  talkers;  hence,  the  identification 
scores  are  based  on  N=43  for  Group  A  and  N=20  for  Group  B. 
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variability  of  Group  A  must  be  less  than  that  of  Group  B. 

Table  3  also  presents  results  of  the  classification  of  the  normal 
test  samples  from  both  groups.   These  data  reveal  much  lower  levels  of 
identification,  about  one-in-three  correct  for  the  normal  productions. 
Thus,  it  would  appear  that  the  SFF  vector  is  unsatisfactory  for  differ- 
entiating talkers  when  used  alone.   Further,  these  scores  are  slightly 
(and  artificially)  inflated  since  the  number  of  subjects  is  smaller  for 
both  groups.   Moreover,  SFF  demonstrates  a  diminished  ability  to  identify 
talkers  in  the  presence  of  the  production  distortions,  stress  and  dis- 
guise.  The  reduction  due  to  stress  is  minimal  but  is  exceptionally 
severe  when  the  talker  is  attempting  a  disguise.   Apparently,  the  ex- 
posure to  a  stressful  situation  only  mildly  alters  SFF.   However, 
speakers  seem  to  be  conscious  of  pitch  as  a  vocal  quality  and  frequently 
change  it  to  disguise  their  voices.   Hence,  SFF  appears  to  be  affected 
in  some  degree  by  stress  and  is  dramatically  altered  during  disguise, 
resulting  in  a  decrease  in  this  vector's  effectiveness. 

The  measures  of  speaking  fundamental  frequency  employed  in  this 
study  were  only  about  one-half  as  effective  as  those  chosen  by  Atal 
(1972).   Thus,  it  would  appear  that  fQ  and  its  variability  do  not 
adequately  characterize  individual  speakers  but  the  talker's  pattern 
of  change  in  fundamental  frequency  throughout  an  utterance  is  somewhat 
unique  for  an  individual. 
Speaking  Time  (ST) 

The  ST  vector  is  composed  of  two  measures:  the  amount  of  time 
that  the  talker  is  producing  acoustic  energy  above  a  threshold  and  the 
amount  of  information  he  transmitted  in  each  32-second  interval.   The 
second  measure  could  be  extracted  in  either  of  two  ways:   (1)  the  num- 
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ber  of  words  and  (2)  the  number  of  phonemes.   These  are  related  mea- 
sures; only  one  need  be  included  in  the  analysis.   It  is  far  easier 
to  determine  word  counts  than  phoneme  counts;  however,  if  the  length 
of  words  in  a  particular  reading  is  different  than  the  others,  the 
word  count  will  vary  independently  of  a  speaker's  speech  rate.   To 
examine  whether  word  or  phoneme  counts  were  more  stable  and  thus  pro- 
vide more  accurate  identifications,  discriminant  analyses  were  conducted 
with  each  parameter  used  with  the  other  measure  in  the  ST  vector.   Table 
k   contains  the  results.   From  these  limited  number  of  data,  it  is  diffi- 
cult to  make  a  definitive  statement  concerning  the  difference  in  the 
sensitivity  of  the  two  parameters  to  speaker  recognition.   However, 
there  is  an  indication  that  the  phoneme  count  is  a  somewhat  more  stable 
measure.   Therefore,  all  other  analyses  are  performed  using  the  number 
of  phonemes  produced  per  interval  —  as  wel 1  as  the  measure  of  the 
amount  of  time  acoustic  energy  is  generated  during  the  utterance. 

Table  5  shows  the  results  of  the  discriminant  analysis  for  both 
groups  of  speakers  based  on  ST.   Again,  the  posterior  classifications 
of  the  reference  samples  indicate  that  this  vector  does  not  accurately 
characterize  each  speaker.   As  would  be  expected  with  such  poor  group 
definition,  ST,  with  scores  of  12.0  and  20.0%  for  Group  A  and  Group  B, 
respectively,  is  an  unsatisfactory  vector  upon  which  identifications 
are  to  be  made. 
Combined  Vectors 

The  posterior  classification  of  the  reference  samples,  as  listed 
in  Table  6,  suggests  that  combining  the  vectors  (LTS  (passband),  SFF 
and  ST)  improved  the  definition  of  each  group  of  talkers.   Since  all 
posterior  classifications  produce  levels  of  greater  than  90%  correct 
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Table  k.      Classification  of  Speakers  Based  on  Two  Forms  of  Speaking 
Time  Vector  (N=50). 


Correct  Identification 
(in  %) 

Form Reference Test 

Words  19.3  8.0 

Phonemes  26.7  12.0 


Table  5.   Classification  of  Observations  Based  on  Speaking  Time  (ST). 


Correct  Identification 

(in  %) 

N=50  N=25 
Normal Normal Stress    Disguise 

Posterior  Classification 

of  "Knowns"  26.7         56.0 

Identification  of  "Unknowns"   12.0  20.0      12.0       16.0 
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Table  6.   Classification  of  Observations  Based  on  Various  Combinations 
of  the  Limited  Passband  Long-Term  Power  Spectra  (LTS) ,  Speak- 
ing Fundamental  Frequency  (SFF)  and  Speaking  Time  (ST). 


Correct  Identification 
(in  %) 

N=50*  N=25" 

Normal Normal Stress    Disguise 


Exper  imenta 1 
Cond  it  ion 


Posterior  Classification 
of  "Knowns" 


LTS  x  SFF 

100.0 

LTS  x  ST 

99.3 

SFF  x  ST 

91.5 

LTS  x  SFF  x  ST 

100.0 

100.0 

100.0 

98.3 

100.0 


Classification  of  "Unknowns' 


LTS  x  SFF 

99.7 

LTS  x  ST 

84.0 

SFF  x  ST 

55.8 

LTS  x  SFF  x  ST 

100.0 

85.0  55.0  20.0 

88.0  64.0  28.0 

45.0  40.0  15.0 

100.0  60.0  15.0 


SFF  could  not  be  extracted  for  all  talkers;  hence,  wherever  SFF  is 
used,  identification  scores  are  based  on  N=43  for  Group  A  and  N=20 
for  Group  B. 
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identification,  it  would  appear  that  the  merging  of  two  or  more  individ- 
ual vectors  into  a  simple,  larger  vector  should  provide  more  accurate 
speaker  identifications. 

LTS  and  SFF.   As  noted,  all  analyses  involving  the  SFF  vector  are 
based  on  43  rather  than  50  speakers  (Group  A)  and  20  rather  than  25 
speakers  (Group  B).   Therefore,  some  of  the  increase  in  identification 
levels  shown  in  Table  6  (for  the  SFF  factor  used  in  combination  with 
others)  may  be  related  to  this  reduction  in  the  number  of  speakers. 
However,  even  when  this  relationship  is  tempered  by  such  considerations, 
it  is  apparent  that  LTS/SFF  combination  vector  improves  speaker  identi- 
fication for  normal  productions.   A  comparison  of  the  data  contained  in 
Tables  1  and  3  with  those  in  Table  6  shows  that  the  identification 
levels  for  LTS  x  SFF  increased  21.7%  and  5-0%  above  LTS  alone  and  67.5% 
and  50%  above  SFF  alone  for  Groups  A  and  B,  respectively.   On  the  other 
hand,  for  the  stress  and  disguise  productions,  the  impact  of  this  com- 
bination vector  is  not  clear.   First,  when  LTS  is  combined  with  SFF, 
it  is  notable  that  the  level  of  identification  is  approximately  doubled 
for  stress  and  disguise.   Undoubtedly,  the  change  is  due  to  the  addi- 
tion of  LTS.   Conversely,  however,  speaker  identification,  under  condi- 
tions of  stress  or  disguise,  remains  essentially  unchanged  when  SFF  is 
added  to  LTS.   The  apparent  depression  of  5.0%  for  the  stress  condition 
seems  to  be  related  to  the  reduced  speaker  population;  i.e.,  since  four 
of  the  five  subjects  whose  SFF  could  not  be  extracted  had  been  correctly 
identified  using  LTS  alone,  the  correct  identification  level  for  the 
LTS  vector  on  the  population  of  20  could  be  less  than  was  found  for  the 
complete  group  of  25.   Under  conditions  of  production  distortion,  LTS 
apparently  is  not  aided  by  the  addition  of  SFF. 
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LTS  and  ST.   The  results  of  discriminant  analyses  based  on  the 
band  limited  LTS  vector  and  ST  also  are    listed  in  Table  6.   In  this 
case,  there  is  an  increase  in  the  correct  identification  rates  for  each 
group  for  the  combined  vector  relative  to  when  either  LTS  or  ST  is  used 
alone.   Again,  when  comparisons  are  made  of  the  results  shown  in  Table 
6  to  those  in  Tables  1  and  5,  this  paired  vector  is  slightly  better  than 
LTS  alone.   Moreover,  the  large  improvement  of  LTS  x  ST  over  LTS  alone 
apparently  is  attributable  to  the  initial  level  of  LTS.   Further,  ST 
seems  to  have  some  impact  on  the  identification  of  speakers  --  especially 
when  it  is  combined  with  LTS  --  who  are   disguising  their  voices.   Ad- 
mittedly, the  scores  are  low  and  may  reflect  little  beyond  chance  cor- 
rect classifications.   Nevertheless,  LTS  x  ST  appear  to  be  the  "best" 
vector  combination  when  attempts  are   made  to  determine  the  identity  of 
a  talker  producing  a  disguised  voice. 

SFF  and  ST.   Again,  these  analyses  are  conducted  with  the  reduced 
subject  population.   The  results  for  this  vector  pair  can  be  seen  in 
Table  6;  the  findings  for  the  separate  vectors  are  contained  in  Tables 
3  and  5.   Identification  of  the  normal  speech  samples  for  both  groups 
is  about  50.0%,  a  level  substantially  better  than  when  either  vector  is 
used  alone.   The  improvement  resulting  from  pairing  implies  that  the 
information  contained  in  SFF  is,  in  fact,  different  than  the  information 
attributable  to  ST.   As  a  result,  confusions  based  on  the  processing  of 
one  vector  are  sometimes  resolved  by  the  other  vector.   The  identifi- 
cation of  speakers  who  are  under  stress  or  attempting  to  disguise  their 
voices  remained  at  an  unacceptably  low  level  with  this  vector  pair. 

Due  to  the  reduced  number  of  speakers  evaluated  when  SFF  is  used, 
it  is  difficult  to  state  the  effect  of  combining  these  vectors  when  the 
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speech  is  distorted.   The  changes  are  slight  and  may  reflect  the  re- 
moval of  a  disproportionate  number  of  correctly  identified  talkers. 

LTS  and  SFF  and  ST.   Finally,  all  three  vectors  were  combined  in 
one  analysis  for  each  group  of  speakers.   Using  the  reduced  subject 
populations,  Table  6  shows  that  all  normal  test  samples  were  cor- 
rectly classified  by  use  of  this  triple  vector.   However,  LTS  (full- 
band)  alone  had  achieved  this  same  result  with  the  larger  (complete) 
set  of  talkers.   The  exact  number  of  parameters  --  between  12  and  23  — 
from  the  complete  LTS  vector  necessary  to  reach  all  correct  classifica- 
tions was  not  tested.   However,  when  the  full  vector  (23  parameters)  is 
used,  all  normal  samples  are  properly  classified.   When  using  the  com- 
bined LTS  x  SFF  x  ST  only  15  parameters  are  used  to  reach  that  level 
of  cl ass  i  f icat  ion. 

Correct  identification  rates  for  stress  and  disguise  are  slightly 
lower  in  some  cases  than  those  observed  for  previous  analyses  using 
subvectors.   However,  as  noted  for  the  LTS  x  SFF  paired  vector,  the 
apparent  reduction  may  result  from  having  removed  a  number  of  correctly 
identified  talkers  from  the  population  when  SFF  could  not  be  extracted. 
In  any  case,  it  must  be  said  that  the  LTS  x  SFF  x  ST  triple  vector  is 
not  capable  of  identifying  talkers  when  the  voices  are  disguised. 
Further,  this  triad  of  vectors  is  only  as  successful  at  differenti- 
ating among  talkers  who  are  under  stress  as  is  the  LTS  (passband)  alone. 
Indeed,  LTS  (fullband)  is  the  most  accurate  of  all  vectors  or  vector 
combinations  when  used  to  identify  stressed  speakers. 
Further  Discussion 

Generally,  the  results  of  this  study  indicate  that  LTS  (fullband) 
is  the  most  effective  acoustic  measure  upon  which  to  base  speaker 
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identification  --  it  properly  classified  all  observations  of  normal 
speech  productions.   Furthermore,  LTS  (fullband)  produced  the  highest 
score  for  the  stress  condition  and  one  of  the  highest  for  the  disguise 
cond  i  t  ion. 

When  LTS  (fullband)  was  reduced  from  23  to  11  parameters  to  simu- 
late a  limited  passband,  the  identification  level  is  markedly  better 
than  either  of  the  other  two  vectors,  ST  and  SFF,  when  they  were  eval- 
uated separately.   LTS  (passband)  also  performed  at  a  higher  level 
(than  the  other  single  vectors)  for  the  conditions  of  stress  and  dis- 
guise.  Moreover,  ST  appears  to  be  the  poorest  vector  for  determining 
a  talker's  identity  with  one  exception,  namely,  when  the  speaker  is 
attempting  to  disguise  his  voice.   However,  the  identification  level 
for  ST  under  conditions  of  disguise  may  be  little  more  than  a  chance 
finding.   At  any  rate,  while  the  ST  scores  used  alone  are  too  low  to 
be  of  practical  utility,  the  possibility  remains  that  ST  may  ultimately 
provide  better  discrimination  of  disguised  voices  when  combined  with 
other  parameters. 

The  SFF  parameter,  although  better  than  ST,  also  proved  to  be 
unsatisfactory  when  used  alone.   Indeed,  correct  classification  rates 
were  low  for  both  test  and  reference  samples.   SFF  alone  cannot  differ- 
entiate talkers  but  may  be  useful  as  an  adjunct  to  another  vector. 

The  relative  levels  of  identification  for  the  pairwise  combina- 
tions of  the  three  vectors  remain  unchanged  from  their  individual  per- 
formance.  LTS  (passband)  is  the  most  powerful  --  of  those  fully 
tested  --  regardless  of  which  other  vector  is  associated  with  it.   And, 
in  most  cases,  correct  identification  rates  improve.   Further,  SFF  gen- 
erally produces  better  results  than  ST  when  each  is  paired  with  LTS 
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except  that  ST  and  LTS  combine  to  produce  the  highest  identification 
rates  for  disguised  voices.   The  combination  of  ST  and  SFF  produce 
substantially  better  results  than  would  be  predicted  from  their  indi- 
vidual scores.   As  previously  noted,  the  information  contained  in  each 
set  of  measures  seems  to  be  independent  and  their  pairing  eliminates 
some  of  the  confusions  present  in  the  separate  cases. 

Another  objective  of  this  study  was  to  determine  whether  various 
statistical  analyses  of  the  data  produced  different  results;  it  was 
possible  to  do  a  partial  evaluation  since  other  analyses  have  been 
carried  out  on  the  LTS  (fullband).   As  noted  previously,  comparisons 
were  made  between  Euclidean  distance  (Majewski  and  Hoi  lien,  197*0, 
cross-correlations  (Zalewski  et  al . ,  in  press)  and  a  discriminant 
analysis.   Comparisons  could  be  made  only  for  the  LTS  vector,  and  in 
this  case,  there  was  a  slight  increase  in  correct  identification  be- 
tween the  lower  rate  achieved  using  the  Euclidean  distance  and  cross- 
correlation  techniques  and  the  higher  rate  achieved  with  discriminant 
analysis.   While  the  differences  between  the  three  techniques  are  not 
dramatic,  the  discriminant  analysis  did  produce  the  most  correct 
identifications.   Further,  discriminant  analysis  is  considered  by 
some  researchers  (e.g.,  Bricker  et  al . ,  1971)  to  be  appropriate  to  the 
speaker  identification  problem.   Since  the  discriminant  analysis  can 
be  found  in  general  purpose  statistical  packages,  utilization  of  this 
analysis  procedure  is  both  convenient  and  appropriate.   Further,  mod- 
ification of  the  parameter  set,  i.e.,  the  addition,  deletion  or  altera- 
tion of  the  vector,  is  a  straightforward  matter  requiring  only  minor 
changes  in  the  program  statements.   To  perform  a  similar  change  in  a 
user-generated  program  could  require  substantial  changes  in  the  estab- 
1 ished  rout  ine. 


39 


Finally,  it  should  be  pointed  out  that  one  of  the  speech  analysis 
systems  utilized  in  this  research  is  linked  directly  to  a  computer. 
This  means  that  all  of  the  values  for  each  of  the  remaining  parameters 
must  be  recorded  manually  and  handled  separately  for  computer  pro- 
cessing.  If  the  results  of  each  type  of  analysis  could  be  fed  directly 
to  a  computer,  then  all  three  vectors  could  be  extracted  simultaneously 
and  stored  temporarily.   Simple  listings  or  additional  processing  then 
could  be  conducted  rather  easily  and  without  the  susceptibility  to 
human  errors  that  exists  in  the  present  system. 


IV 
SUMMARY  AND  CONCLUSIONS 

The  primary  concern  of  this  research  was  to  evaluate  the  effec- 
tiveness of  specified  vectors  —  either  singly  or  in  various  combina- 
tions --  to  correctly  identify  talkers  from  their  speech  alone.   The 
effectiveness  of  the  acoustic/temporal  characteristics  of  the  voice 
(long-term  speech  spectra,  speaking  fundamental  frequency  and  speaking 
time)  were  evaluated  with  and  without  distortions.   The  distortions 
included  a  limited  frequency  band  for  transmission,  and  stress  and  dis- 
guise for  production. 

The  results  indicate  that  LTS  is  generally  the  most  powerful  vec- 
tor upon  which  to  base  speaker  identifications.   In  most  cases,  SFF, 
although  not  satisfactory,  seems  to  be  a  better  discriminator  than  ST. 
With  few  exceptions,  speaker  identification  improved  when  vectors  were 
combined.   Finally,  although  ST  performed  poorly,  it  may  be  a  useful 
adjunct  to  other  parameters  in  identifying  talkers  who  are  trying  to 
disguise  their  voices. 

Several  conclusions  may  be  drawn  from  this  study.  It  must  be  con- 
cluded that,  since  none  of  the  vectors  or  combinations  of  vectors  could 
sustain  high  levels  of  identification  for  all  conditions  of  speech  pro- 
duction, the  vectors  examined  in  this  study  are  not  measuring  an  in- 
variant characteristic  of  a  speaker's  voice.  Obviously,  the  production 
of  speech  signals  under  stress  or  disguise  results  in  significant  changes 
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in  spectral  composition,  fundamental  frequency  and  speaking  time.   Con- 
sidering the  level  of  identification  observed  with  LTS,  speakers  may 
operate  within  a  limited  portion  of  their  range  during  most  of  their 
speech.   However,  noting  the  decrease  in  identifications  when  their  pro- 
ductions are  distorted,  it  is  obvious  that  they  —  either  actively  or 
passively  --  modify  the  speech  signal  for  each  of  the  acoustic/temporal 
factors. 

Secondly,  ST  seems  to  have  a  greater  capacity  to  determine  the 
identity  of  talkers  who  are  attempting  a  disguise  than  would  be  ex- 
pected from  its  performance  with  normals.   If  the  results  under  dis- 
guise are  not  a  chance  finding,  it  would  appear,  as  hypothesized,  that 
talkers  do  not  attend  to  that  temporal  factor  while  they  are  attempting 
to  disguise  their  voice.   For  this  reason,  additional  research  seems 
warranted  to  improve  the  overall  performance  level  of  ST. 

In  view  of  the  observed  speaker  identification  levels,  the  full- 
band  LTS  vector  used  alone  seems  to  be  the  best  procedure  for  identi- 
fying talkers,  i.e.,  it  produces  the  highest  correct  identification 
rates  except  when  a  disguise  is  attempted.  Therefore,  unless  further 
research  can  demonstrate  that  SFF  and  ST  can  produce  acceptable  ident- 
ification levels  --  even  if  only  for  specified  conditions  --  it  would 
appear  sufficient  to  extract  only  the  LTS  vector  from  the  speech  sig- 
nal . 

Finally,  LTS  (fullband)  was  selected  to  test  for  possible  differ- 
ences in  speaker  identification  resulting  from  the  use  of  a  specific 
analysis  technique.   For  reasons  of  appropriateness,  accuracy  and  ease 
of  data  handling,  the  packaged  discriminant  analysis  seems  more  effec- 
tive than  the  Euclidean  distance  method  of  Majewski  and  Hoi  lien  (197*0  °r 
the  cross-correlation  technique  of  Zalewski  et  al.  (in  press). 


APPENDIX 


Adapted  from:   An  Apology  For  Idlers 
By  Robert  Louis  Stevenson 


If  you  look  back  on  your  own  education,  I  am  sure  it  will  not 
be  the  full,  vivid,  hours  of  truancy  that  you  regret.   You  would 
rather  cancel  out  some  of  the  lack-luster  periods  between  sleep  and 
waking  that  you  experienced  in  school.   For  my  own  part,  I  have 

attended  a  good  many  lectures  in  my  time 1  still  remember  that  the 

spinning  of  a  top  is  a  case  of  kinetic  stability.   But  though  I  would 
not  willingly  part  with  such  scraps  of  science,  I  do  not  set  the  same 
store  in  them  as  by  certain  other  odds  and  ends  that  I  came  upon  in 
the  open  street  while  I  was  playing  truant. 

Extreme  busyness,  whether  at  school  or  college,  church  or  market, 
is  a  symptom  of  deficient  vitality.   A  faculty  for  idleness  implies 
a  catholic  appetite  and  a  strong  sense  of  personal  identity.   There 
are   a  sort  of  dead-alive,  hackneyed  people  about,  who  are   scarcely 
conscious  of  living  except  in  the  exercise  of  some  conventional  occu- 
pation.  Bring  these  fellows  into  the  country,  or  set  them  on  board 
ship,  and  you  will  see  how  they  pine  for  their  desk  or  their  study. 
They  have  no  curiosity;  they  cannot  give  themselves  over  to  random 
provocations  nor  do  they  take  pleasure  in  the  exercise  of  their  fac- 
ulties for  its  own  sake.   Unless  necessity  lays  about  them  with  a 
stick,  they  will  even  stand  still.   It  is  no  good  speaking  to  such 
folk.   They  cannot  be  idle;  their  nature  is  not  generous  enough.   They 
pass  those  hours,  which  are  not  dedicated  to  furious  toiling  in  the 
gold-mill,  in  a  sort  of  coma.   When  they  do  not  require  to  go  to  the 
office,  when  they  are  not  hungry  or  have  no  mind  to  drink,  the  whole 
breathing  world  is  a  blank  to  them.   If  they  have  to  wait  an  hour  or 
so  for  a  train,  they  fall  into  a  stupid  trance  with  their  eyes  open. 
To  see  them  you  would  suppose  there  was  nothing  to  look  at  and  no  one 
to  speak  with.   You  would  imagine  they  were  hypnotized  or  frozen.   Yet, 
very  possibly  they  are  hard  workers  in  their  own  way,  and  have  good 
eyesight  for  a  flaw  in  a  deed  or  a  turn  of  the  market.   They  have  been 
to  school  and  college,  but  during  all  that  time  they  had  their  eye  only 
on  their  grades.   They  have  gone  about  in  the  world  and  mixed  with 
clever  people,  but  all  the  time  they  were  thinking  of  their  own  af- 
fairs.  As  if  a  man's  soul  were  not  too  small  to  begin  with,  they  have 
dwarfed  and  narrowed  theirs  by  a  life  of  all  work  and  no  play.   Here 
they  are   forty,  with  a  listless  attention,  a  mind  vacant  of  all  mater- 
ial of  amusement,  and  not  one  thought  to  rub  against  another  while  they 
wait  for  that  train.   Before  he  grew  up  he  might  have  clambered  on  boxes. 
When  he  was  twenty  he  would  have  stared  at  the  girls.   But  now  the  pipe 
is  smoked  out,  the  snuff-box  empty,  and  my  gentleman  sits  bolt  upright 
upon  a  bench  with  vacant  eyes.   This  does  not  appeal  to  me  as  being  a 
"Success  in  Life." 

But  it  is  not  only  the  person  himself  who  suffers  from  his  busy 
habits,  but  his  wife  and  children,  his  friends  and  relations,  and  even 
the  very  people  he  sits  with  in  a  railway  carriage  or  a  bus.   Perpetual 
devotion  to  what  a  man  calls  his  "business"  is  only  to  be  sustained  by 
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perpetual  neglect  of  many  other  things.   In  fact,  it  is  not  by  any 
means  certain  that  a  man's  "business"  is  the  most  important  thing  he 
has  to  do. 
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